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Abstract 

We consider the problem of sequential decision making under uncer- 
tainty in which the loss caused by a decision depends on the following 
binary observation. In competitive on-line learning, the goal is to design 
decision algorithms that are almost as good as the best decision rules 
in a wide benchmark class, without making any assumptions about the 
£f~) • way the observations are generated. However, standard algorithms in this 

area can only deal with finite-dimensional (often countable) benchmark 
classes. In this paper we give similar results for decision rules ranging over 
an arbitrary reproducing kernel Hilbert space. For example, it is shown 
that for a wide class of loss functions (including the standard square, ab- 
solute, and log loss functions) the average loss of the master algorithm, 
over the first N observations, does not exceed the average loss of the best 
decision rule with a bounded norm plus 0(A r_1//2 ). Our proof technique is 
very different from the standard ones and is based on recent results about 
defensive forecasting. Given the probabilities produced by a defensive 
forecasting algorithm, which are known to be well calibrated and to have 
good resolution in the long run, we use the expected loss minimization 
principle to find a suitable decision. 



1 Introduction 



In the simple problem of sequential decision making that we consider in this 
paper, the loss A(y„,7„) (maybe negative) caused by a decision j n depends 
only on the following binary observation y n . All relevant information available 
to the decision maker by the time he makes his decision is collected in what we 
call the datum, x n . For example, in time series applications the datum may 
contain all or the most recent observations; in pattern recognition, where the 
observations are the true classes of patterns, the datum may be the vector of a 
pattern's attributes. 

The traditional approach to this problem assumes a statistical model for the 
sequence of pairs (x n ,y n ); e.g., statistical learning theory (|21j) assumes that 
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the (x n ,y n ) are generated independently from the same probability distribu- 
tion. A more recent approach, known in learning theory as "prediction with 
expert advice" (e.g., 0]) and in information theory as "universal prediction" 
(e.g., E3H3|), avoids making assumptions about the way the observations and 
data are generated. Instead, the goal of the decision maker is to compete with 
a more or less general benchmark class of decision rules, mapping the xs to the 
ys (the framework of prediction with expert advice is usually even more gen- 
eral). We will use the phrase "competitive on-line" to refer to this area (as in 
|23| . emphasizing similarities to competitive on-line algorithms in computation 
theory). 

First papers on competitive on-line learning with general loss functions (e.g., 
[221 ) dealt with countable (often finite) benchmark classes. The next step was 
to consider finite-dimensional benchmark classes (e.g., (SJ^JIJSl). This paper 
continues with infinite-dimensional classes. (Such classes were considered earlier 
by Kimber and Long [111113) . who, however, assumed that the benchmark class 
contains a perfect decision rule.) To get an idea of our central results, the reader 
is advised to start from Corollaries ^3 

Our implicit assumption, common with other work in competitive on-line 
learning, is that the decision maker is "small": his decisions do not affect the 
future observations. This is not a mathematical assumption: as already men- 
tioned, we do not make any assumptions at all about the way observations are 
generated; however, interpretation of our results becomes problematic if the de- 
cision maker is not small. "Big" decision makers can still use our algorithms for 
prediction (cf. Remarks H an d El below) . 

In conclusion of this section we will briefly describe the content of the paper. 
Our main result is stated in and several examples are given in It is proved 
in SJS] in 21 we describe the main ideas behind the proof and in S^we prove 
some preparatory results for ^ Our decision algorithm is explicitly described 
in Sj7| We conclude with a short list of directions of further research (SJEJ. A 
preliminary version of this paper is to appear as .24 ■ In this new version we 
made the title of the paper more specific (the old title was even somewhat 
misleading: in prediction with expert advice, the experts are usually completely 
free in making their decisions). 

2 Main result 

Our decision protocol is: 

FOR n = 1,2,...: 

Reality announces x n G X. 

Decision Maker announces j n G T. 

Reality announces y n G {0, 1}. 
END FOR. 

At each step (or round) n Decision Maker makes a decision j n whose conse- 
quences depend on the observation y n G {0, 1} chosen by Reality. All relevant 
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information available to Decision Maker by the time he makes his decision is 
collected in x n , called the datum. We assume that the data Xfi ELI" G elements of 
a data space X and that the decisions are elements of a decision space Y (both 
sets assumed non-empty). 

Remark 1 In this paper we are interested, first of all, in prediction of future 
observations. However, our framework allows a fairly wide class of loss functions, 
not all of which can be interpreted in terms of predictions (such as, e.g., Cover's 
and long-short games, in the terminology of §2). This is the main reason 
why we prefer to talk about decision making in general; another reason is that 
in Sj5]we will deal with a very different kind of prediction (for which we reserve 
the term "forecasting"). 

A decision strategy is a strategy for Decision Maker in this protocol (ex- 
plicitly defined specific strategies will also be called "decision algorithms" ) . Its 
performance is measured with a loss function A : {0, 1} x T — > R, and so its 
cumulative loss over the first N rounds is 

N 

^2 A(y„,7«)- 

n=l 

The pair (r, A) is the game being played. Decision Maker will compete against 
a class J 7 , called the benchmark class, of functions D : X — > Y considered as 
decision rules; the cumulative loss suffered by such a decision rule is 

N 
n=l 

Before stating our main result we define some useful notions connected with 
the two main components of our decision framework, the game (r, A) and the 
benchmark class T . The reader might want in parallel to read the next section, 
which describes some important examples of games and benchmark classes. 

Games 

The exposure Exp A (7) £ R of a decision 7 € Y is 

Exp A ( 7 ) :=A(1, 7 )-A(0, 7 ) 

and the exposure Exp A D : X — > R of a decision rule D at a point x £ X is 

Exp A , D (z) := A(l, D(x)) - A(0, D(x)). 

Let A(p, 7) be the expected loss caused by taking a decision 7 when the proba- 
bility of 1 is p: 

A(p,7):=pA(l,7) + (l-p)A(0,7). (1) 
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We only consider games (T, A) such that 

C Q := ini :A(0, 7 ), C x := inf A(l, 7 ) (2) 

are finite. It is convenient (see, e.g., [§]) to summarize a game by its superdeci- 
sion set 

£ := {(^y) el 2 |3 7 er:x> A(0, 7 ) and y > A(l, 7 )} ; (3) 

elements of this set will be called superdecisions. Superdecisions of the form 
(A(0, 7), A(l, 7)) will sometimes be called decisions. We will assume, addition- 
ally, that the set E C R 2 is convex and closed. The Eastern tail of the game is 
the function 

/: [C ,oo)^MU{co} 

x 1 — ► m£{y\(x,y) e £} - C x ' 

and its Northern tail is 

9 :[C 1 ,oo)^MU{oo} 

where, as usual, inf := 00; it is clear that / and g are nonnegative everywhere 
and finite on (Co, 00) and (Ci,oo), respectively. 

The theorem 

A reproducing kernel Hilbert space (RKHS) on X is a Hilbert space T of real- 
valued functions on X such that the evaluation functional / € J 7 1— * f{x) is 
continuous for each x E X. By the Riesz-Fischer theorem, for each x £ X there 
exists a function K. x € J 7 such that 

f(x) = (K x ,f)r, V/GJP. 

Let 

cjr := sup HKzll^ ; (6) 

we will be interested in the case Cjr < 00. With each game (r, A) and each RKHS 
T we associate the non-negative (but maybe infinite) constant c\ j= defined by 

c \,r '■= SU P sup supp(l -p) (Exp^(7) + IIK^H^) 

P e(o,i)7er pa; ex 

= sup sup p(l - p) (Exp^(7) + C^r) , 
pe(o,i) 7er p 

where T p := argmin 76 r A(p, 7) (and X(p, 7) is defined by Q). 
The following is our main result. 
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Theorem 1 Let the game (r, A) be such that 0) are finite, the superdecision 
set £ is convex and closed, and the tails f and g satisfy 

f' + {t)=0{t-% g' + (t)=0(t- 2 ) (8) 

as t — > oo, where f' + and g' + stand for the right derivatives (see, e.g., \15\/ . 
%23) of f and g. Let T be an RKHS on X and c?, C\jr be defined by and 
Pfy. Suppose cjr < oo. Then c\jr < oo and there is a decision strategy which 
guarantees that 

N N 

^A(y n , 7n ) <^A(y„,I3(xJ) + c A ^(||Exp A)I ,||^ + l) Vn (9) 

n— 1 n—1 

for all N = 1, 2, . . . and all D : X — > T with Exp A D G T . 

Remark 2 If the loss function A is bounded, (JSJ holds trivially. The right 
derivatives in JSJ can be replaced by the corresponding left derivatives, since 
|/+| < \f-\ and 1 < \g'_\ (see, e.g., [E], Theorem 24.1). Condition © 
can be interpreted as saying that the tails should shrink fast enough. The case 
fit) = g(t) = t^ 1 can be considered borderline; Theoremnis still applicable in 
this case, but it ceases to be applicable for tails that shrink less fast. 

3 Examples 

In this section we first define a specific RKHS and then describe three important 
games. 

Kernels as source of RKHS 

We start by describing an equivalent language for talking about RKHS. The 
kernel of an RKHS T on X is 

K(x,x') := (K x ,K x ,)r 

(equivalently, we could define T£(x,x') as K. x (x') or as ~K x /(x)). There is a 
simple internal characterization of the kernels K of RKHS. 

It is easy to check that the function K.(x,x'), as we defined it, is 
symmetric (K(x, x') = J£(x',x) for all x,x' 6 X) and positive definite 
(YhLi YJj=i aiajK(xi,Xj) > for all m = 1,2,..., all (a>i, . . . , a m ) e W n , 
and all {x\ , . . . , x m ) € X m ). On the other hand, for every symmetric and 
positive definite K : X 2 — > M there exists a unique RKHS T such that K is the 
kernel of J- (pp, Theoreme 2). 

We can see that the notions of a kernel of RKHS and of a symmetric positive 
definite function on X 2 have the same content, and we will sometimes say "kernel 
on X" to mean a symmetric positive definite function on X 2 . Kernels in this 
sense are the main source of RKHS in learning theory; see, e.g., [23 1 and 
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|18| for numerous examples. Every kernel on X is a valid parameter for our 
decision algorithm; to apply Theorem ^ we can use the equivalent definition of 

Cjr := sup y/~K(x, x). 

A long list of RKHS together with their kernels is given in §7.4. For 
concreteness, in this section we will use the Sobolev space S of absolutely con- 
tinuous functions / on M with finite norm 



P(x)dx+ Cf'(x)) 2 dx; (10) 



its kernel is ^ 

K(x,x') = -exp(- \x-x'\) 

(see [20] or 0, §7.4, Example 24). From the last equation we can see that 
The square loss game 

For the square loss game, T = [0, 1] and X(y, 7) = (y — j) 2 , and so we have 

Exp A ( 7 ) = A(l, 7) - A(0, 7) = (1 - if - 7 2 - 1 - 2 7 , (11) 
\{p, 7) = p(l - 7 ) 2 + (1 - p) 7 2 = p(l - p) + (7 - P) 2 , 

and 

T p = {p}. (12) 

Therefore, 

'cjr/2 ifc^>l 
k (l + c£)/4 ifc^<l; 

in particular, c A 5 =3/8 for the Sobolev space (|10|l . and Theorem implies 

Corollary 1 Suppose the decision space is X = R. There is a decision strategy 
that guarantees that, for all N and all decision rules D € <S, 

v v _ 

J>» - 7„) 2 < J>„ ~ ^»)) 2 + g (l|2i? - % + 1) VJV 

n— 1 n— 1 

(2D — 1 is the decision rule "normalized" to take values in [—1, 1]). 

Remark 3 The games of this section illustrate Remark Q] here the decisions 
7„ are best interpreted as predictions of y n . Loss functions A satisfying l|12f> 
are called proper scoring rules. Such loss functions "encourage honesty" : it is 
optimal to predict with the true probability (provided it is known). We will 
later see another loss function of this type (the log loss function). 
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To illustrate Corollary suppose there are constants c > 1 and d > 1 and 
a good absolutely continuous decision rule D : K — ► [0, 1] such that |x n | < c, 
n = 1,2, . . ., and |-D'(^)| < d for all i£X. At rounds AT 3> cd 2 the average loss 
of our decision algorithm will be almost as good as (or better than) the loss of 
D. We refrain from giving similar illustrations for the other corollaries in this 
section. 

The absolute loss game 

In this game, X(y, 7) = \y — j\ with T = [0, 1]. We find: 

Exp A ( 7 ) = A(l, 7) - A(0, 7) = (1 - 7) - 7 = 1 - 2 7 
(the same as in the square loss case, 1)1 ljl). 

Hp, 7) = p(i - 7) + C 1 -p)i =P + (i - 2 P)7, 

and 

f{0} if p < 1/2 

r P = hi} if P > 1/2 

[[0,1] if p = 1/2. 

Therefore, 




(in particular, c\ t $ = \/6/4), and we have the following corollary of Theorem^ 

Corollary 2 Let X = M. There is a decision strategy that produces decisions 
7„ suc/i that, for all N and all D e 5, 

W N /fi — 

E iw» - ^1 < E iw» - + x (l|2i? - 11,5 + 1} ^ (13) 

n— 1 n— 1 

The log loss game 

For the log loss game, T = (0, 1) and 

Hy,l) = -j/ In 7 - (1 - y)ln(l - 7)- 
For this game, ca.jc < 00 (assuming Cjr < 00) since its tails satisfy 

f(t) = g'(t) = —J— ~ -e" 4 = O^- 2 ); 
e 1 — 1 

this will be also clear from the following direct calculation. Since 

Exp A ( 7 ) = A(l, 7) - A(0, 7) = - In 7 + ln(l - 7) = In ^—L, 
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A(p,7) = -pin 7 - (1 -P) ln (! -7) = A(p,p) + I>0,7) 
(where D(p, 7) := pin £ + (l — In is the Kullback distance between p and 



7, known to take its minimal value in 7 at 7 = p), and T p 
c a,^ from above as follows: 



{p}, we can bound 



sup p(l — p) 



In 



1 



P 



pe(o,i) 



P 



<4/4- 



sup p(l — p) (In 

pe(o,i) 



1-P 



4/4 + 0.439 < 4/4 + 0.44. 



Of course, for specific values of cjr it is better to find the sup pe ( ^ directly, 
without using this bound. Such a direct calculation shows that c\_$ w 0.693 < 
0.7, and Theoremn now implies the following. 

Corollary 3 Some decision strategy in the log loss game with X = K produces 
decisions j n such that, for all N and all D : X — > (0, 1) with the log-likelihood 
ratio In in S, 



N N 
7i—l n—1 



X(y n ,D{x n )) + 0.7 



In 



D 



1 - D 



1 ] y/N. 



4 Idea of the proof of Theorem [T] 

This section describes the intuition behind the proof. The following sections, 
which carry out the proof, are formally independent of this section. We will also 
describe a general research program that may lead, it can be hoped, to many 
other results. 



Game-theoretic probability 

Our proof technique is based on a game-theoretic alternative to the standard 
measure-theoretic axioms of probability (53)- Many of the standard laws of 
probability, including the weak and strong laws of large numbers, the central 
limit theorem, and the law of the iterated logarithm, can be restated in terms 
of perfect information games involving three key players: Reality, Forecaster, 
and Skeptic. A typical game-theoretic law of probability states that Skeptic has 
a strategy which, without risking bankruptcy, greatly enriches him if the law is 
violated. All such strategies for Skeptic were explicitly constructed continuous 
functions; game-theoretic laws of probability with a continuous strategy for 
Skeptic will be called "continuous laws of probability" . 

Game-theoretic probability as developed in J7| was to a large degree par- 
allel to measure-theoretic probability. Following [7] and the literature that this 
paper spawned, paper |27j pointed out a surprising feature of game-theoretic 
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probability: for any continuous law of probability, Forecaster has a strategy 
that prevents Skeptic's capital from growing (cf. Lemma ^ below) . In other 
words, for any continuous law of probability there is a forecasting strategy that 
is perfect as far as this law is concerned (we will say "perfect relative to" this 
law). This result was obtained in |27] for binary forecasting, and in [2Sj it was 
extended to more general protocols. Forecasting strategies obtained in this way 
from various laws of probability were called "defensive forecasting" strategies. 

General procedure 

Now we are ready to describe a general procedure whose implementation leads, 
in the most straightforward case, to Theorem ^ 

Choose a goal which could be achieved if you knew the true probabilities 
generating the observations. It is important that this goal should be "prac- 
tical", in the sense of being stated in terms of observable quantities, such as 
data, decisions, and observations. The goal is not allowed to contain theoretical 
quantities, such as the true probabilities themselves, and it should be achievable 
no matter what the true probabilities are. Construct a decision strategy which, 
using the true probabilities, leads to the goal. 

Realistically, however, we do not know the true probabilities. To get rid of 
them, isolate the law of probability on which the proof that your decision strat- 
egy achieves the goal depends; typically, this law can be stated as a continuous 
game-theoretic law of probability. (If the proof depends on several laws, they 
should first be merged into a single law.) There is a forecasting strategy whose 
forecasts are at least as good as (and often better than) the true probabilities, 
as far as the law you have just isolated is concerned. It remains to feed your 
decision strategy with those forecasts. 

Implementing this procedure for various interesting goals appears to be a 
promising research program. 

Introduction to the proof 

In this paper our goal is to achieve ©, which we roughly rewrite as 

N N 

^KVniln) ^2 X(y n ,D(x n )), 

n—1 n—1 

where the informal notation ;$ is used to mean that the left-hand side does not 
exceed the right-hand side plus a quantity small as compared to N. The goal 
is stated in terms of the observables. 

Let us see how our goal could be achieved if we knew the true probabilities 
p n that y n = 1 (slightly more formally, p n is the conditional probability that 
y n = 1 given the available information). By the law of large numbers (see, 
e.g., ^Hjj Theorem VII. 5. 4, for a suitable measure-theoretic statement and |17| . 
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Theorem 4.1, for its game-theoretic counterpart), we expect 



A' 



^2 f(Pn: X n)(Vn ~ Pn) 



< N (14) 



if / is a bounded function (assumed measurable in the measure-theoretic case) . 
If / is allowed to range over a function class J- that is not excessively wide, (|14fl 
will still continue to hold uniformly in /. 

Suppose, for simplicity, that T p is a singleton for all p € [0,1]; the only 
element of T p will be denoted G(p). Our decision strategy will make the decision 
G(p n ) at round n, i.e., the decision that leads to the smallest expected loss. We 
will sometimes say that G is our "choice function" . 

Notice that 

A(y, 7 ) - A(p, 7) = (y ~ P) (A(l, 7) - A(0, 7)) 

always holds (this can be checked by subtracting from A(y, 7) := j/A(l,7) + 
(1 — 2/)A(0, 7)). In conjunction with the law of large numbers (|14fl this implies 

N N 



^2 x (Vn,-fn) = ^2x(y n ,G{p n )) 

n—1 n—1 

N N 
= HPn, G(p n )) + (A(2/n, G(p n )) - A(p„, G(pn))) 
n—1 n—1 
N N N 

■■^2x( Pn ,G(p n ))+^2(y n -p n )(X(l,G(p n ))-X(Q,G(p n ))) < £ A(p„, G(p B )) 

n—1 n—1 n—1 

AT AT AT 

- X! ^(^n)) = X! D ( x n)) ~ ^2 (HVn, D{x n )) - A(p n , D(x n ))) 

n—1 n—1 n—1 

JV N 

= X(y n ,D{x n )) - - P„)(A(1, I>(a: n )) - A(0, £>(*«))) 

n=l 

<£A(y njJ D(a: n )). (15) 



This shows that we can achieve our goal if we know the true probabilities, and 
it remains to replace the true probabilities with the forecasts that are perfect 
relative to the law of large numbers. 

For clarity, let us summarize the idea of the proof expressed by lfl"51l . To 
show that the actual loss of our decision strategy does not exceed the actual 
loss of a decision rule D by much, we notice that: 

• the actual loss 53n=i ^(j/n? G(p n )) of our decision strategy is approxi- 
mately equal, by the law of large numbers, to the (one-step-ahead condi- 
tional) expected loss Yl n =i MPn, G(p n )) of our strategy; 
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• since we used the expected loss minimization principle, the expected loss 
of our strategy does not exceed the expected loss of D\ 

• the expected loss Y^n=i ^{Pn,D(x n )) of D is approximately equal to its 
actual loss Y^ n =i MVn: D{x n )) (by the law of large numbers). 

To get the strongest possible result, we will have to use more specific laws of 
probability than the general law of large numbers. It will be convenient to use 
the following informal terminology introduced in |25| . Let p n be the forecasts 
output by some forecasting strategy (rather than the true probabilities). We 
say that the forecasting strategy has good calibration- cum-resolution if the left- 
hand side of (|14fl is much less than N for a relatively wide class of functions 
/ : [0, 1] x X — > R and large N. We say that the strategy has good calibration 
if 

N 

^Xy n -Pn)f(Pn) 



for a wide class of functions / : [0, 1] 
strategy has good resolution if 



and large N. Finally, we say that the 



N 



(Vn -Pn)f(x n ) 



< N 



for a wide class of / : X — > K and for large N. For a detailed discussion and 
examples, see [55) . 

Notice that in applying the law of large numbers to establishing the two 
approximate inequalities in l|15|) we need not general / = f(p,x) but only / = 
f(p) (known in advance) and / = f(x). In particular, we only need calibration 
and resolution separately, not calibration-cum-resolution. These are the two 
specific probability laws we will be concerned with. 

The requirement that T p should always be a singleton (in fact, we will even 
need the function G(p) to be continuous) is restrictive: for example, it is not 
satisfied for the absolute loss function. To deal with this problem, we will have to 
consider forecasting strategies that output extended forecasts (p n ,Qn) € [0, l] 2 , 
where p n is the forecast of y n and the extra component q n will play a more 
technical role. 

The next section is devoted to constructing a perfect forecasting strategy 
relative to the law of large numbers. In the following section we will be able to 
prove Theorem ^ 



5 The algorithm of large numbers 

This section is the core of our proof of Theorem^ First we describe a forecasting 
protocol in which Forecaster tries to predict the observations chosen by Reality. 
Following [TJI, we introduce another player, Skeptic, who is allowed to bet at 
the odds implied by Forecaster's moves. 
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Binary Forecasting Game I 

Players: Reality, Forecaster, Skeptic 

Protocol: 

FOR n = 1,2,...: 

Reality announces x n G X. 

Forecaster announces (p n ,q n ) G [0, l] 2 . 

Skeptic announces s n G M. 

Reality announces y n G {0, 1}. 

K n := K n -i + s n (y n - Pn)- 
END FOR. 

The real forecast is p n (the "probability" that y n — 1), which is interpreted 
as the price Forecaster charges for a ticket paying y n ; s n is the number of 
tickets Skeptic decides to buy. The protocol describes not only the players' 
moves but also the changes in Skeptic's capital K n ; its initial value JCq can be 
an arbitrary real number. Skeptic demonstrates that the forecasts are poor if 
he manages to multiply his initial capital (assumed positive) manyfold without 
risking bankruptcy (i.e., JC n becoming negative). Forecaster also provides an 
additional number q n G [0, 1] which does not affect Skeptic's capital; intuitively, 
the role of q n is to help those of Forecaster's customers who find themselves in 
a position of Buridan's ass (find two or more actions equally attractive in view 
of the forecast p n ) to break the tie. 

The main difference between our decision protocol (stated at the beginning 
of fJ5Jl and the protocols of this section is that in the latter Forecaster implicitly 
claims (by pricing the tickets) that he has the fullest possible knowledge of the 
way Reality chooses the observations, and Skeptic tries to prove him wrong by 
gambling against him. In the decision protocol, Decision Maker does no make 
any such claims and simply tries to minimize his losses. 

It will be convenient to make the set [0, l] 2 from which the forecasts (p n , q n ) 
are chosen into a topological space. The lexicographic square £ is defined to be 
the set [0, l] 2 equipped with the following linear order: if (xi,j/i) and (3:2,2/2) 
are two points in £, (xi, j/i) < (£3,3/2) means that either x% < X2 or x\ — 
£2,2/1 < 2/2- (Cf. UJ, Problem 3.12.3(d).) The topology on the lexicographic 
square is, as usual, generated by the open intervals 

(a, b) := {u G £ \ a < u < b} , 

a and b ranging over £. As a topological space, the lexicographic square is 
normal ( 5 , Problem 1.7.4(d)), compact (0, Problem 3.12.3(a), ^0], Problem 
5.C), and connected (0, Problem 6.3.2(a), [TO], Problem 1.1(d)). 

As in , we will see that for any continuous strategy for Skeptic there 
exists a strategy for Forecaster that does not allow Skeptic's capital to grow, 
regardless of what Reality is doing. To state this observation in its strongest 
form, we make Skeptic announce his strategy for each round before Forecaster's 
move on that round rather than announce his full strategy at the beginning of 
the game. Therefore, we consider the following perfect-information game: 
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Binary Forecasting Game II 
Players: Reality, Forecaster, Skeptic 
Protocol: 

FOR n = 1,2,...: 

Reality announces x n S X. 
Skeptic announces continuous S n : £ — » R. 
Forecaster announces (p n ,q n ) S «£. 
Reality announces y„ € {0, 1}. 
/C„ := /C„_i + S n (p n ,q n )(y n —p n ). 
END FOR. 

Lemma 1 Forecaster has a strategy in Binary Forecasting Game II that ensures 

Before proving this lemma, we will need another lemma, which will play the 
role of the Intermediate Value Theorem, used in |25| . 

Lemma 2 If a continuous function f : £ — > R takes both positive and negative 
values, there exists x € £ such that f{x) = 0. 

Proof A continuous image of a connected compact set is connected ([5], The- 
orem 6.1.4) and compact ([5], Theorem 3.1.10). Therefore, f{£) is a closed 
interval. I 

Proof of Lemma ^ Forecaster can now use the following strategy to ensure 
fco _ K>\ _ • ' ' ; 

• if the function S n (p, q) takes value 0, choose (p n , q n ) such that S n (p n , q n ) = 

0; 

• if S n is always positive, take p n := 1 and choose q n G [0, 1] arbitrarily; 

• if S n is always negative, take p n :— and choose q n € [0, 1] arbitrarily. I 

A kernel K on £ xX is forecast- continuous if the function K((p, g, x), (p' , q 1 ,x') ) 
is continuous in (p, q,p', q') £ £ 2 , for each fixed (a;, x') € X 2 . (Kernels on £ x X 
arc defined analogously to kernels on X.) For such a kernel the function 

n— 1 ^ 

S n {p, l) '■= ^2 K ((P> q ' a: »)' 2 K (( p ' 9 ' 9 ' X n)){^-^P) 

i=l 

(16) 

is continuous in (p, q) £ £. 

The lexicographic algorithm of large numbers (£ALN) 
Parameter: forecast-continuous kernel K on £ x X 
FOR n = 1,2,...: 
Read x n G X. 
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Define S n {p,q) by (p,q) € £. 
Output any root (p, q) of S n (p, q) — as (p n ,q n ): 
if there are no roots, 

set p n := (1 + signS'„)/2 and set q n to any number in [0, 1]. 
Read y n G {0,1}. 
END FOR. 

(Notice that signS„ is well defined by Lemma [3 ) It is well known that there 
exists a function $ : £ x X — + TL (a feature mapping taking values in a Hilbert 
space H) such that 

K(o, b) = $(o) • $(&), Vo, 6 G £ x X. (17) 

(For example, we can take the RKHS on £ x X with kernel K as 7i and take 
a i — ► K a as the feature mapping there are, however, easier and more trans- 
parent constructions.) It can be shown that ^(p, q,x) is forecast- continuous, 
i.e., continuous in (p,q) G £ for each fixed x G X, if and only if the kernel K 
defined by (|T7|) is forecast-continuous (see, e.g., |2S], Appendix B). 

Theorem 2 Let K be the kernel defined by \l r )\j for a forecast- continuous fea- 
ture mapping <f> : £ x X — > 7Y. TTie lexicographic algorithm of large numbers 
with parameter K outputs (p n ,Qn) such that 



N 



^2iVn ~ Pn)$(Pn,qn, X n ) 



N 



<Y,Pn(l-Pn)\MPn,qn,Xn)\\ 2 (18) 



n=l 



n=l 

always holds for all N = 1,2,.... 

Proof Following £ALN Forecaster ensures that Skeptic will never increase his 
capital with the strategy 



)) (Pn,q n ))(l-2 Pn ). (19) 

Using the formula 

(Vn ~Pn) 2 = Pn(l ~Pn) + (1 - 2p n )(y n - p n ) 

(which can be checked by setting y n := and y n := 1), we can see that the 
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increase in Skeptic's capital when he follows i|19|) is 

N 

JC N - /C = 2_j s "(y« " Pn) 

n=l 
N n-l 

= K (P»' Qi' x i)) (Vn ~ Pn){Vi ~ Pi) 

n—1 i—1 

l N 

)) (1 - 2p n )(y n -p n ) 

n=l 
_^ TV JV 

= 2 JIX^^™' 9 "' 2 ^)' (P*>*> x »)) (Vn -Pn){Vi ~ Pi) 
n—1 i—1 

1 w 

- 2 X! K ((Pn,Qn,X n ), (Pn,q n ,X n )) (tin ~ Pnf 
n=l 

1 ^ 

+ 2 X! K {.(Pn,Qn,X n ), (Pn,q n ,X n )) (1 ~ 2p n )(Vn ~ Pn) 

n=l 
^ N N 

= 2 ^2^2 K (iPn,qn,X n ), (p i ,q i ,X l )) (j/ n - p n ){Vi ~ Pi) 
n—1 i—1 

l w 

))Pn(l -p 



n=l 



1 " 

^ J^M 1 ll $ (P»'9n,-T„)|| 2 , 



which immediately implies 1)18(1 . 



Resolution 

This subsection makes the next step in our proof of Theorem ^ Its forecasting 
protocol is: 

FOR n = 1,2,...: 

Reality announces x n e X. 

Forecaster announces (p n ,Qn) G [0, 1] . 

Reality announces y n € {0, 1}. 
END FOR. 

Our goal is to prove the following result (although in fj^lwe will need a slight 
modification of this result rather than the result itself). 
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Theorem 3 Let T be an RKHS on X. The forecasts (p n , q n ) output by £ALN 
always satisfy 



N 



11/11^ 



for all N and all functions f G T . 

Proof Applying ALN to the feature mapping x £ X i— > K x S J 7 and using 
(|18|) . we obtain 



^2(Vn -Pn)f(%n) 



n=l 



N 



N 



^2(y n -p n )'K Xn j 



< 



N 



^2(yn -Pn)K Xri 



n = l 



l/ll 



< 11/11 



for any / e f. 



A' 



\ X^^'^ 1 --P«) K ( a:; n J :z: n) < C ^ \\.f\ 
\ n=l 



N 



>J2 Pn (l-Pn) (20) 



Remark 4 In the terminology introduced in the previous section, Theorem|3is 
about resolution. This is sufficient for the purpose of this paper, but it is easy to 
see that similar statements hold for calibration-cum-resolution and calibration. 
For example, let J- be an RKHS on £ x X. The forecasts (p n ,q n ) output by 
£ALN always satisfy 



N 



^2(Vn -Pn)f(Pn,q n 



71=1 



<f \\f\\? ^ 



for all N and all functions / e T . 



6 Proof of Theorem [T] 

Before starting the proof proper, we need to discuss two topics: choosing a 
suitable choice function and "mixing" different feature mappings. 

The canonical choice function 

Let us say that a straight line (1 — p)x + py = c in the (x, y)-plane, where 
p € [0, 1] and c € K, is southwest of the superdecision set £ (defined by (J3J)) if 

V(a;, y) G £ : (1 — p)x + py > c. 

For each p e [0, 1] let c(p) be the largest c (which obviously exists) such that 
the line (1 — p)x + py — c is southwest of S. It is clear that, for p € (0, 1), the 
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line (1 — p)x + py = c{p) intersects £ and the intersection, being compact and 
convex, has the form [A(p), B(p)], where A(jp) and B{p) are points (perhaps 
A(p) — B(p)) on the line. For concreteness, let A(p) be northwest of B(p) 
(i.e., if A(p) = (Aq,Ai) and B(p) = (Bq,Bi), we assume that Aq < B and 
A\ > B\). Now we can define the canonical choice function G associated with 
(r, A) as follows: 

• if < p < 1 and q G [0, 1], G(p, q) is defined to be any 7 e T satisfying 

(A(0, 7), A(l, 7)) = (1 - q)A(p) + qB(p)- 
the existence of such a 7 is obvious; 

• if p = and q E [0, 1], G(p, q) is defined to be any fixed 70 S T satisfying 

(A(0,7o),A(l,7o)) = (G),/(Co)) 

(Co and / are defined in © and (0J); if /(Co) = 00, such a 70 does not 
exist and G(p, q) is undefined; 

• if p = 1 and 9 £ [0, 1], G(p, g) is defined to be any fixed 71 <E T such that 

(A(0, 7i),A(l, 71)) = G?(C 1 ),C 1 ) 

(Ci and g are defined in J5J and JSJ); if g(C\) = 00, such a 71 does not 
exist and G(p, q) is undefined. 

It is easy to see that the function (A(0, G(p, q)), A(l, G(p, q))) is continuous in 
{p,q) € domG and, therefore, Exp AG (p, g) := Exp A (G(p, (7)) is continuous in 
(p, g) G dom G. We defined G in such a way that it is a "perfect" choice 
function: \{p,G{p,q)) = inf 7e rA(p, 7) for virtually all (p, q) (in any case, for 
all (p, q) £ domG). 



Mixing 

In the proof of Theorem ^ we will m i x the feature mapping $> (j),q,x) :— 
Exp A G (p, q) (into Tig := K) and the feature mapping $i(p, <?, x) := used in 
the proof of Theorem |3| (as discussed in 21 we w iU have to achieve two goals 
simultaneously, only one of them connected with resolution). This can be done 
using the following corollary of Theorem 

Corollary 4 Let $j : £ x X — > 7Yy , j = 0, 1, be forecast- continuous mappings 
from £ x X to Hilbert spaces TLj. The forecasts output by £ALN with a suitable 
kernel parameter always satisfy 



N 



^2,(Vn ~ Pn)<S>j(Pn,q n ; %n ) 



n=l 



H 4 



N 

< ^2p n (l-p n ) (||$ (P n, qn, Xn)\\f-{ + || 'I'l (Pn , q-n , Xn) \\ 



for all N and for both j — and j = 1 . 
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Proof Define the direct sum H oiHo and Hi as the Cartesian product Ho x Hi 
equipped with the inner product 

1 

(g,g')n = ({9o,9i),(g'o,g'i))n ■= ^2(gj,g'j)H r 

Now we can define $ : £ x X — > H by 

*(p,g,x) := ($o(p,g,^),$iO,?,a;)); 
the corresponding kernel is 

K((p, g, i), (p', g', a/)) := ($(p, g, a:), g', a/)) w 
1 1 
= ^ ($j(p, g, a;), g', a/)) Wj = ^ K 3 -((p, g, z), (j/, g', a/)), 



where Ko and Ki are the kernels corresponding to $o and 3>i, respectively. 
It is clear that this kernel is forecast-continuous. Applying ,£ALN to it and 
using JTSJ, we obtain 



N 



n=l 



N 

y^XVn - Pn)^o{Pn, q n , X n ),^2(y n - Pn)$l{Pn,q n ,X n ) 



< 



\n=l 



N 



n=l 
N 



< ^Pn{l ~ Pn) \\^{Pn,q n 
H n=1 

N 1 

z ^2Pn(l ~ Pn)^2\\<S>3(Pn,q n J x n) \\f 
n=l j=0 



Merging $q and <£>i by Corollary 21 we obtain 



N 



^2(Vn ~ Pn) Exp A G (p n , q n ) 



N 



^2{yn-Pn)$o(Pn,q n ,Xn) 



< 



N 



\ ^2Pn{l "Pn) (Exp^ G (p„,9„) +K(x n ,X n )) (21) 
\ n=l 
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and, using i|20|). 

N 

^(Vn -Pn)f{Xn) 



n=l 



< 



N 



l/ll 



N 



^2(Vn ~ Pn)$l(Pn, Qn, %n) 



l/ll 



N 



)), (22) 

\ n=l 



<\\f\\ F . 

for each function f E J-. 
Proof: Part I 

In this subsection we will assume that domG = £. Subtracting Q from 
A(y, 7) = 2/A(l, 7) + (1 - y)A(0, 7), we obtain 

A(2/, 7 ) - A(p, 7 ) = (y-p)(A(l, 7 ) - A(0, 7 )) = (y-p)Exp A ( 7 ) (23) 

(we already did this in 21 but we promised that the rest of the paper would be 
formally independent of <2J. Using the last equality and (|21|) - (|22[1 . we obtain 
for the decision strategy 7„ := G(p n , q n ) based on the (jp n , q n ) output by £ALN 
with the merged kernel as parameter: 



N 



N 



^2 HVn, In) = ^2 KVn, G(Pn, <Zn)) 
n—1 n—1 

N N 

= ^2HPn,G{p n ,q n )) + 7J (A(y n , G(p n , q n )) - X(p n ,G{p n ,q n ))) 

n—1 n—1 

N N 

71=1 
JV 

- ^2HPn,G{p n ,q n )) 



71=1 



n=l 
iV 



a ^2Pn(l -Pn) (Exp^ G (p„,g„) + K(a; n ,ar n )) 

\ n=l 



\ X^P"^ 1 ( Ex Pa,g(p™><7™) + K(a;„,a; n )) 

\ n=l 



<J2 X (Pn,D(x n )) 

71=1 
AT 

= A(y„,D(a; n )) - ^ (A(j/„, D(a; n )) - A(p„,D(x„))) 



A' 



\ X]^ 1 ( Ex Pa,g(p™'9») + K(x„,a;„)) 

\ n=l 
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N 



N 



^2 KVn,D{x n )) - ^2(y n - p n )Exp x D (x n )) 



N 



\ ^2Pn{l-Pn) (Exp^ G (> n ,g„) + K(x n ,X n )) 
\ n=l 



N 



<^A( 2 /„,^(a;„)) + ||Exp A) 



D\\r 



N 



\ ^2Pn{l-Pn) (Expl G (p n ,q n ) +K(x n ,X n )) 



N 



\ ^Pni 1 ~Pn) (Exp^ . G (p n ,q n ) + K(x n ,X n )) 



= J2x(y n ,D(x n )) + (||Exp 

n=l 
N 

< Y J KVn,D{x n )) + (||Exp 



N 



\ ^2Pn(l - Pn) (Exp 2 X G (p n ,q n ) + 



It remains to show that Ca,.f < oo (assuming < oo, here and in the rest 
of this section). In this case, domG = £, this is easy: essentially, this is the case 
of a bounded loss function (the reservation "essentially" is needed since T can 
contain "litter" — decisions dominated by other decisions in T). Since Exp A G is 
continuous and £ is compact, 

sup p(l - p) (Expl G (p, q) + c%) < oo. 

Proof: Part II 

The stripped lexicographic square is the subset 

t£t : = (0, 1) x [0, 1] 

of £. In this subsection we consider the case domG = T£L 

The order and topology on t£t are inherited from £. The following analogue 
of Lemma |3 still holds. 

Lemma 3 If a continuous function f : t£t — > M. takes both positive and negative 
values, it also takes the value 0. 

Proof See the proof of Lemma|21 f('£') is still a connected set in E. I 

A kernel K on If^xX is forecast- continuous if the function K ((p, q, x), (p' , q' , x 1 )) 
is continuous in (p,q,p',q') € (t£^) 2 . The function Ijl6|l is then continuous in 
(Pi q) € t£ ; an( i f° r our current kernel 



K ((p, g , x), (p', </, z')) = Exp AjG (p, g) Exp AjG (p', q') + (K x , K x ,)^ (24) 
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it equals 

n-1 

S n {p, l) = 5I( ExP ^g(P> ? ) Ex PA,G(Pii%) + ( K ^„, K x,)^)(y» "Pi) 
i=l 

+ 5 (e*Pa,g(p. 9) + ||K X J|^) (1 - 2p) 
= AExp A G (p, g) + B + i Exp^ iG (p, g)(l - 2p) + Cp, (25) 

where A, _B, and C do not depend on (p, g). Since dom G = , |Exp A G (p, g)| — > 
00 as p — > or p — > 1 , and so 

lim 5„(p,g) = -oo (26) 
(p,g)->(l,0) 
(p,«)et£t 

and 

lim S„0,g) = oo. (27) 

(p,g)->(0,l) 

The stripped lexicographic ALN (or, briefly, tf^ALN) is defined as the lexico- 
graphic ALN except that: 

• its parameter is a forecast-continuous kernel K on x X; 

• it outputs a root (p, q) (an element of t£* = dom5 m ) of the equation 
SniPiQ) = as (p n -,q n ) and crashes if this equation does not have roots 
(this will never happen for the kernel l|HJl). 

Because of JgS) and ((23, WALN applied to the kernel O on t£t x X still 
ensures that (|18l) holds for our feature mapping ($0, $1); this algorithm never 
crashes and, of course, never outputs (p n , q n ) with p n G {0, 1}. We can see that 
the proof of 10 given in the previous subsection still works. 

Let us now prove that c\ jr < 00 when dom G — . It suffices to check that 

lim sup p Explj G (p, q) < 00 (28) 
0,g)— (0,1) 
(P,«)etet 

and 

limsup (1 - p) Exp^ G (p, q) < 00. (29) 
(p,g)-Ki,0) 
(p,?)et£t 

For example, let us demonstrate i|29[l . Without loss of generality, we replace J5J) 
with 

f'_(t) = 0(t- 2 ), g>_(t) = 0(t- 2 ) (30) 

(this can be done since f'_ (t) < f' + (t) < fL(t+l) and g'_(t) < g' + {t) < g'_(t+l)). 
Consider the decision 

(X,Y) := (X(0,G(p,q)),\(l,G(p,q))). 



21 



Since — — £ is a subgradient (see, e.g., [T3j, Section 23) of f(x) at X, l|5U|) implies 
that 1 - p= 0(X- 2 ), i.e., (1 -p)X 2 = 0(1). Since |Exp A>G (p, q)\ = X - Y < 
X — Ci for (p,q) < (1,0) sufficiently close to (1,0), (EH indeed holds. 

Proof: Part III 

In this subsection we consider the remaining possibilities for domG. Let us 
define the left- stripped lexicographic ALN (t£ALN for brief) as the lexicographic 
ALN except that: 

• its parameter is a forecast-continuous kernel K on '£ x X, where the left- 
stripped lexicographic square 

te := (0, 1] x [0, 1] 

is equipped with the order and topology inherited from £; 

• it outputs a root (p,q) G \£ of the equation S n (p,q) — as (p n ,q n ); if 
this equation does not have roots in '£, we set p n := 1 and set q n G 
[0, 1] arbitrarily (we will make sure that this happens only when S n is 
everywhere positive). 

In a similar way we define the right- stripped lexicographic square £^ and the 
right- stripped lexicographic ALN (£^ALN), which always outputs (p n ,qn) € 
when S n {p, q) = does not have roots (p, q) G £^ we now set p n := 0. 

We only consider the case domG = t£ (the case domG = £^ is treated 
analogously); this corresponds to /(Go) = oo and /(Gi) < oo. Since S n is con- 
tinuous, the absence of roots of S n = in '£ in conjunction with l|27|l means that 
S n is positive everywhere on <£, and so setting p n := 1 in this case guarantees 
that t£ALN still ensures l|18|l . It remains to notice that (|28|l still holds. 

7 The algorithm 

In this short section we extract the decision strategy achieving @ from our 
proof of Theorem ^ As we have already noticed (see l|25|l). 

n-l 

Snip, q) = ( Ex Pa,g(P> <l) Ex Pa,g(^> 1i) + K(sc n , Xifj iyi - Pi) 
i=l 

+ i (Exp 2 (G (p, g) + K(x n) x n j) (1 - 2p); (31) 
this immediately leads to the following explicit description. 
An algorithm achieving © 

Parameters: game with loss function A and canonical choice function G; 
kernel K on X 
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FOR 71 = 1,2,...: 
Read x n e X. 

Define S n (p, q) by (|3*T)l for all (p, q) e £ for which G(p, q) is defined. 
Define (p„, q n ) as any root (p, q) of 5„(p, g) = 0; 
if there are no roots, 

set p n := (1 + signS„)/2 and set q n to any number in [0,1]. 
Set 7„ := G(p n ,q n ). 
Read y n £ {0,1}. 
END FOR. 

(We saw in the previous section that sign SVi is well defined and is —1 or 1 in 
this context.) 

The canonical choice functions for the three examples of games given in J|] 
are as follows: G(p, q) = p for the square loss and log loss games, and 



for the absolute loss game. 

8 Directions of further research 

In this section we discuss informally what we consider to be interesting directions 
of further research. 

Non-convex games 

Theorem Q] assumes that the superdecision set is convex. The assumption of 
convexity is convenient but not indispensable. We will only discuss the simplest 
non-convex game. 

The loss function for the simple loss game is the same as for the absolute 
loss game, A(y, 7) = \y — j\, but T = {0, 1}. Now the approach we have used 
in this paper does not work: since T consists of two elements, there is no non- 
trivial continuous choice function G : £ — > V (every continuous image of £ is 
connected: [5], Theorem 6.1.4). 

A natural idea (U) is to allow Decision Maker to use randomization. The 
expected loss of a strategy making decision 1 with probability 7 and with 
probability 1 — 7 is \y — 7I, where y is the actual observation; therefore, for the 
simple loss game a randomized decision strategy can guarantee the following 
analogue of lj*H?|l : 




(32) 




n=l 



n=l 
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where E refers to the strategy's internal randomization (the decision rules D 
can be allowed to take values in [0, 1]). 

The disadvantage of i|33|) is that typically we are interested in the strat- 
egy's actual rather than expected loss. Our derivation of shows the role 
of randomization: with our choice function i|32|) no randomization is required 
unless p = 1/2. Typically, we rarely find ourselves in a situation of complete 
uncertainty, p n = 1/2; therefore, only a little bit of randomization is needed, 
essentially for tie breaking. The actual loss will be very close to the expected 
loss. It would be interesting to derive formal statements along these lines. 

Non-binary observations 

It would also be interesting to extend this paper's results to more general ob- 
servation spaces (first of all, to carry them over to least-squares regression and 
multi-class classification). The two apparent obstacles to such extensions are 
that the fundamental equality (|23|l looks tailored to the binary case y £ {0, 1} 
and that Lemma ^ ceases to be obvious outside the binary case. However, 
1)23(1 only states, in the terminology of that A(p, 7) is the game-theoretic 
expected value of A(y, 7) (and that reproducing A(y, 7) given A(p, 7) can be 
accomplished by buying A(l,7) — A(0, 7) tickets paying y and costing p each). 
Similar equalities hold for many other forecasting protocols. And an analogue 
of Lemma n for a wide class of forecasting protocols is proved in |26| . 

Optimality 

An important problem is to investigate the optimality of our algorithm, de- 
scribed in 33 is the bound Q tight? (The tightness of the bounds in Theorem 
[21 and Equation (|2*U)l is established in |25].1 
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